Forecasting: Principles and Practice Chapter 7 - Time Series Regression Models

R. J. Serrano

03/02/2023

Time Series features —

Learning objectives:

Introduction —

In this chapter, we will discuss regression models. The basic concept is that we forecast the time series of interest \(y\) assuming that is has a linear relationship with other time series \(x\).

7.1 - The linear model —

In the simplest case, the regression model allows for a linear relationship between the forecast variable \(y\) and a single predictor \(x\):

Figure 7.1: An example of data from a simple linear regression model

Example: US consumption expenditure

Let’s plot the time series of quarterly changes (growth rates) of real personal consumption expenditure (\(y\)) and real personal disposable income (\(x\)) for the U.S. from 1970 Q1 and 2019 Q2.

data("us_change")
us_change

us_change %>% 
     pivot_longer(c(Consumption, Income), 
                  names_to = 'Series') %>% 
     autoplot(value) + 
     labs(y = '% change')

Using the lm base function to calculate intercept and slope coefficients.

us_chg_tbl <- us_change %>% 
     as_tibble() %>% 
     select(Consumption, Income)

model_lm <- lm(Consumption ~ Income, data = us_chg_tbl)
summary(model_lm)
## 
## Call:
## lm(formula = Consumption ~ Income, data = us_chg_tbl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58236 -0.27777  0.01862  0.32330  1.42229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.54454    0.05403  10.079  < 2e-16 ***
## Income       0.27183    0.04673   5.817  2.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5905 on 196 degrees of freedom
## Multiple R-squared:  0.1472, Adjusted R-squared:  0.1429 
## F-statistic: 33.84 on 1 and 196 DF,  p-value: 2.402e-08

Scatterplot with fitted regression line.

us_change |>
     ggplot(aes(x = Income, y = Consumption)) +
     labs(y = "Consumption (quarterly % change)",
          x = "Income (quarterly % change)") +
     geom_point() +
     geom_smooth(method = "lm", se = FALSE)

Cross-correlation between the two time series.

# base R
ccf(us_chg_tbl$Consumption, us_chg_tbl$Income)

us_change %>% 
     CCF(Consumption, Income) %>% 
     autoplot()

The cross correlation plot can be used to determine lags as useful predictors.

Mutiple regression

us_change |> 
     
     # drop `Consumption`, `Income` columns
     select(-Consumption, -Income) |>
     pivot_longer(-Quarter) |>
     ggplot(aes(Quarter, value, colour = name)) +
     geom_line() +
     facet_grid(name ~ ., scales = "free_y") +
     guides(colour = "none") +
     labs(y="% change")

Pair plots for us_change

library(GGally)
us_change %>% 
     ggpairs(columns = 2:6, 
             lower = list(
                  continuous = wrap('smooth', 
                                    color = 'steelblue', 
                                    alpha = 0.3, 
                                    size = 0.3)
                  )
             )

Assumptions about the errors (residuals)

7.2 - Least squares estimation —

The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors.

Example: US consupmtion expenditure

fit_consMR <- us_change |>
     model(tslm = TSLM(Consumption ~ Income + Production +
                                    Unemployment + Savings))
report(fit_consMR)
## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90555 -0.15821 -0.03608  0.13618  1.15471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
## Income        0.740583   0.040115  18.461  < 2e-16 ***
## Production    0.047173   0.023142   2.038   0.0429 *  
## Unemployment -0.174685   0.095511  -1.829   0.0689 .  
## Savings      -0.052890   0.002924 -18.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683,  Adjusted R-squared: 0.7635
## F-statistic:   160 on 4 and 193 DF, p-value: < 2.22e-16

Fitted values

augment(fit_consMR) |>
     ggplot(aes(x = Quarter)) +
     geom_line(aes(y = Consumption, colour = "Data")) +
     geom_line(aes(y = .fitted, colour = "Fitted")) +
     labs(y = NULL,
          title = "Percent change in US consumption expenditure"
     ) +
     scale_colour_manual(values=c(Data="black",Fitted="#D55E00")) +
     guides(colour = guide_legend(title = NULL))

augment(fit_consMR) |>
     ggplot(aes(x = Consumption, y = .fitted)) +
     geom_point() +
     labs(
          y = "Fitted (predicted values)",
          x = "Data (actual values)",
          title = "Percent change in US consumption expenditure"
     ) +
     geom_abline(intercept = 0, slope = 1, color = 'red')

Goodness-of-fit

A common way to summarise how well a linear regression model fits the data is via the coefficient of determination, or \(R\)^2. For the example above, the \(R\)^2 = 0.768.

7.3 - Evaluating the regression model

Residual plots

Using the gg_tsresiduals() function introduced in Section 5.3, we can obtain all the useful residual diagnostics mentioned above.

fit_consMR %>% 
     gg_tsresiduals()

augment(fit_consMR) |> 
     features(.innov, ljung_box, lag = 10)

The time plot shows some changing variation over time, but is otherwise relatively unremarkable. This heteroscedasticity will potentially make the prediction interval coverage inaccurate.

The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals.

The autocorrelation plot shows a significant spike at lag 7, and a significant Ljung-Box test at the 5% level. However, the autocorrelation is not particularly large, and at lag 7 it is unlikely to have any noticeable impact on the forecasts or the prediction intervals.

Residual plot against predictors

The residuals from the multiple regression model for forecasting US consumption plotted against each predictor in Figure 7.9 seem to be randomly scattered. Therefore we are satisfied with these in this case.

us_change |>
     left_join(residuals(fit_consMR), by = "Quarter") |>
     pivot_longer(Income:Unemployment,
                  names_to = "regressor", values_to = "x") |>
     ggplot(aes(x = x, y = .resid)) +
     geom_point() +
     facet_wrap(. ~ regressor, scales = "free_x") +
     labs(y = "Residuals", x = "")

Residual plots against fitted values

A plot of the residuals against the fitted values should also show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant.

augment(fit_consMR) |>
     ggplot(aes(x = .fitted, y = .resid)) +
     geom_point() + labs(x = "Fitted", y = "Residuals")

Outliers and influential observations

Observations that take extreme values compared to the majority of the data are called outliers. Observations that have a large influence on the estimated coefficients of a regression model are called influential observations. Usually, influential observations are also outliers that are extreme in the \(x\) direction.

Spurious regressions

Spurious correlations